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LINEAR FEATURE SELECTION WITH APPLICATIONS 


ABSTRACT. This paper selectively surveys contributions in linear feature 
selection which have been developed for the analysis of multipass LANDSAT 
data in conjunction with the Large Area Crop Inventory Experiment. Most 
of the results surveyed have been obtained since early 1973 and have 
applications outside of satellite remote sensing. A few of the theoretical 
results and associated computational techniques have appeared either in 
journal articles or in proceedings of technical symposia. However, most 
of these contributions appear only in scattered contract reports and are 
not generally known by the scientific community. 

Pattern recognition Linear Feature Selection LANDSAT data 

Crop classification Sufficient statistics 

INTRODUCTION 

The Large Area Crop Inventory Experiment (LACIE) is concerned with 
the use of satellite-acquired (LANDSAT) multi spectral scanner (MSS) data 
to conduct an inventory of some crop of economic Interest such as wheat 
over a large geographical area. Such an inventory requires the development 
of accurate and efficient algorithms for data classification. The use of 

9 

niul ti temporal measurements (several registered passes during the growing 
season) increases the dimension of the original measurement space (pattern 
space) thereby increasing the computational load m c'as , icatiun : roceduc*- 
In this connection, the cost of using statistical pattern classificatiot. 
algorithms depends, to a large extent, upon reducing the dimensionality 
of the problem by use of feature selection/combination techniques. These 
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technlques are employed to find a subspace of reduced dimension (feature 
space) in which to perform classification while attempting to maintain 
the level of classification accuracy obtainable in the orignal measure- 
ment space. The most meaningful performance criterion that can be applied 
to a classification algorithm is the frequency with which it misclassi fies 
observations; that is, the probability of misclassification. Consequently, 
one should attempt to select/combine features in such a way that the 
probability of misclassification in feature space is minimized. 

In the sequel we discuss several ways feature selection techniques 
have been used in the LACIE. In all cases the techniques require some ^ 
priori information and assumptions (e.g. number of classes, form of con- 
ditional class density functions) about the structure ’ the data. In 
most cases the classification procedure (e.g. Bayes optimal) has been 
chosen “in advance. Dimensionality reduction is then performed so as to 
(1) choose an optimal feature space in which to perform classification, 
and (2) determine a transformation to apply to measurement vectors prior 
to classification. In all that follows the transformations used for 
dimensionality reduction are linear; that is, the variables in feature 
space are always linear combinations of the original measurements. 

As mentioned above, the mgst meaningful performance criterion for 
a classification procedure is the probability of misclassification 
(denoted in the sequel by G ). However, if the dimension of feature 
space (and therefore measurement space) is greater than one, then G 
is difficult to compute without additional class structure assumptions 
(r.g. equal covariance matrices). As a result, several numerically tractable 
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crlteria have been developed In conjunction with the LACIE which provide 
some information concerning the behavior of G . These criteria are 
discussed in the next section. In a subsequent section we present a 
compendium of recent results on linear feature selection techniques, most 
of which are available only in scattered NASA contract reports. In the 
final section we discuss the use of these techniques in the LACIE, out- 
line some of the investigations underway in the use of linear feature 
selection techniques, and discuss some related open questions. 

MATHEMATICAL PRELIMINARIES 

Let n^,n^, . . . be distinct classes (e.g. crops of interest) 

with known a priori probabilities * •respectively. Let 

X = (x' ,x ,...,x„)^ e R*^ denote a feature vector of measurements (e.g. 

LANDSAT multispectral scanner data from either a single pass or several 

m 

registered passes) taken from an arbitrary element of u H- • Suppose 

i»l ’ 

that the measurement vectors for class IT. are characterized by the 
n-dimensional multivariate normal density function 

P^(x) » (27r)""^^|r. 1’^^^ exp j^- ^(x-u^ )^lj^ (x-u^ ) j , 1 < i < m . 

We assume that the nxl mean vector and the covariance matrix 

E. for each class n. are known (with E. positive definite). 
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1 < i < m . The symbol |A| Is used to denote the determinant of the 
matrix A . The n-dimenslonal probability of misclassification, denoted 

( 2 ) 

by G , of objects from U is given (see Anderson' ' and Andrews' ') 

i=l ^ 

by 

G = 1 - / _ max a.p,(x)dx 
r" l<i<m ' ^ 

m . ^ 

« 1 - I o, / P{(x)dx . 

' 'r, ' 

where the sets , 1 < i < m , called the Bayes ' decision regions « 
are defined by 

R. ' jx e r'' : a,.p. (x) » max a.pJx)^ , 1 < i < m . 

: ' ^ ^ l<j<m J J F - - 

The resulting classification procedure, called the Saves * optimal classifier , 
is defined as follows (see Anderson^^^: 

Assign an element to if its vector x of measurements 
belongs to R^ , 1 < i < m . 

The Bhattacharyya coefficflenl for classes i and j (1 < i , j < m) 
is given (see Kailath^^^) by 

p(ij) * „ {p^(x)pj(x))’-^^dx . 

R 

It has been shown that 
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The quantity p Is usually called the Bhattacharyya distance (or the 
average Bhattacharyya distance ) . 

There have been various attempts to utilize certain functions of 
p(i,j) and p to generate Bhattacharyya related separability measures. 
We refer the reader to the general bibliography and Kanal' ' for further 
variations on this theme. 

The divergence (see Kullback^^^l between classes i .^nd j 
(1 < i , j < m) is given by 

D(i J) - I tr[E..l^)U‘^.j:‘^)] + I tr[E:^i:b(u-Uj)(p.-Uj)^] . 


and the average interclass divergence is given by 


m-1 m 

D = I t D(i.j) 
i=l j“l 




or, equivalently, as shown in Decell and Qui rein' by 


D = ^ tr 


m 


l s, 

i=l ^ ’ 


m(m-l ) 
■ 2 


where 


m .. 

I and - U, - Vj • 

J I 9 


As in the case of p(i,j) and p , various functions of D(i,j) and 
0 have been proposed as class separability measures. 

Kanal provides an excellent exposition of such measures (e.g. 
Shannon entropy, Vajda's average conditional quadratic entropy, Dovijvcr's 
Bayesian distance, Minkowsky measures of nonuniformity, Bhattacharyya 
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bound, Chernoff bound, Koimogorov variational distance, Oevijver's, 
Lissack and Fu's general i:ati on of the latter, Ito's approximating 
functions, and the Jeffreys-Matusita distance). This work contains 304 
references and is perhaps the only comprehensive exposition of the 
subject through early 1974. A more recent nonparametric separability 
measure due to Bryant and Guseman^^^ will be outlined at the end of this 
section. 

( 8 ) 

Devijver' ^ develops a bound on G called the Bayesian distance . 

He gives an excellent development of the concept and its relationship 


to the aforementioned separability measures. His results are quite 
general with regard to the class densities p^(x) and class a priori 
probabilities , 1 < i < m . The Bayesian distance is defined 
to be 


H « 


m 

):i p(x)2 


m 

where p(x) - ^ cXjP^(x) . 


The measure H satisfies the inequality: 


H < G < 1 + 2^ /?? < ^ • 

- - m m m-i 

Following the philosophy discussed in the introduction, the intractable 
nature of the expression for G (while in many instances unnecessary, we 
are restricting our attention to a finite family of normally distributed 
pattern classes) was one, if not the single, reason for developing more 
tractable pattern class separability measures. These measures could then 
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be used In lieu of G to determine mappings from pattern space to 

feature space In which the classification of patterns Is equivalent to 

(G Is preserved) or "nearly equivalent to" classification of patterns In 

pattern space. Two fundamental questions that arise are; First, what 

(If any) relation do the class separability measures bear to G ; 

second, can one develop tractable algirlthms based on the separability 

measures to determine the dimension reducing mappings? 

In connection with these questions we will only consider linear onto 

n k 

mappings B of the measurement space R to R for k < n . This Is 

equivalent to requiring that B be a kxn rank k matrix. This class 
of mappings certainly Includes those of the "feature subset selection" 
type since the selection of any k-feature subset (I.e. any k components 

of X c R*^) can be accomplished by selecting the appropriate kxn matrix 
B consisting of only O's and 1's. The class of kxn rank k matrices 
are more general In the sense that linear combinations of the features 
are permissible. 

In all that follows we will assume that B Is a kxn rank k 
matrix and that X(u>) > x Is a normally distributed random variable. It 
Is well known that If X ' N(m,E) then Y • BX - N(Bu,BLB^) . 

The transformed measurements y • Bx for class 11^ are normally 
distributed with density function 

and the resulting probability of misclassificatlon is given by 
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G(B) » 1 - / L max a. P^(y,B) dy 
l<i<m ^ ’ 


m 

“ ^ - I a. J Pi(y.B) dy . 
i-1 ^ R^.(8) ^ 


where the transformed Bayes' decision regions are given by 


Ri(B) » jy e R*^ : P,(y,B) “ ^ “j . 1 5 i < m . 


The B-Bhattacharyya coefficient for classes i and j is given by 


Pe(i.J) = / |((p,(y.B)Pj(y.B))'^^<iy . 

R 

It has been shown by Decell and Quirein^^^ that for each B 
m-1 m 

G 5 G(B) < J y Pp(iJ) = p(B) . 
i=l j-1+1 “ 


The quantity p(B) is called the B-Bhattacharyya distance or the 
B-averaqe Bhattacharvva distance . 

In addition, it has been show" by Decell and Quirein^®^ that G = G(B) 
if and only if p = p{B) . 

The B-diverqence between classes i ind j ( 1 < i , j < m) is: 


DB(i.j) 


Itr 


+ y tr 


[(BE,B^)‘'-(BEjB^r’](Bu,-Biij)(Buj-Buj)’’ 


and the B-ipterclass divergence is 
m-1 m 

D(B) = I I DgliJ) 
i=l j=l 
jfl 
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or, equivalently (see Decell and Qulrein 


( 6 ) 


) 



k 


where 


m , 

I and ^ 


ij 




While there is no explicit relationship between G and D (or 6(B) and 
D(B)) it was shown by Decell and Quireirt'®^ that D = 0(B) if and only 
if G = G(B) . 


In the present setting and with the obvious general meaning of the 
definition we define the B~Bayesian distance to be 


m 


H(B) = I E 
i*l 


p(y.B)^ 


where 

m 

p(y.B) ■ I Oj p.(y,B) . 

i»l 

( 9 ) 

It has been shown in Gusetnan, Peters and Swasdee' ' that G(B) - G 
if and only if H(B) H . In'this connection, the authors of this paper 
plan to extend the variational results of the next section to include 
Bayesian distance. 

In the next section we will outline related new results conc^'iing, 
among others, questions raised earlier and explain the connection betweeti 
linear feature combination and the classical concept of statistical 
sufficiency. 
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RECENT RESULTS IN LINEAR FLATUPL SELECTION 

In what follows we will be concerned with finding an extreme value 

of some function ♦ {of the reduction matrix B ). For example, we may 

wish to choose ♦(B) = G(B) and find § such that ♦(B) = min G(B) or, 

B 

perhaps, choose ♦(B) * 0(B) and find B such that ♦(§) * max 0(B). 

B 

In seeking an extremun of ♦ , it is natural to consider the 
differentiability of ♦ with respect to the elements of B . In the 
sequel we make use of the Gateau x differential ♦ ^ B ^th increment 
C , denoted by 6^{B;C) , and defined (if the limit exists) by 

^♦(BiC) = lim • 

where G is a k^n matrix. If, for a given kxn matrix B of rank 

k , the above limit exists for each kxn matrix C , then ♦ is said to 
be Gateaux differe ntiable at B . Similarly we define (when the limit 
exists) 

r.(y.BnC)-p.(y,B) 

6pi(y.B;C) » lim 1 

where C is a k>«n matrix. For an excellent discussion of Gateaux 

differentials see Luenberger^^^^ 

Theoretical result', r'lated to minimizing G(B) for two multivariate 
normal classes with equal a^ pr^iqr^ probabil ities and a one-dimensional 
feature space were initially presented by Guseman and Walker^^^^’^^^^, 

The associated computational procedure was presented by Guseman and Walker^^^^ 
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the follcming results for the general case of m n-dimensional normal 

classes with arbitrary a prio ri probabilities and a one-dimensional feature 

M4l 

space appear in Guseman, Peters, and Walker' ' . 


LEW^A. Let B be a nonzero Un vector. Then (omitting subscripts) 


6p(y,B;C) - -p{y,B) 


. -C'-b' 

BEB BEB' (BEB')^ 


for each Nn vector C 


THEOREM . Let B be a nonzero ly<n vector for which a-f. (y,B) t a.f.(y,B) 

• * «J 3 

for i ^ j . Then 6 is Gateaux differentiable at B , and 

5G(B;C) » - ; a. / 6p.(y,B;C)dy 

.1 ' Ri(B) ’ 


THE OREM. Let B be a nonzero l>^n vector at which G assumes a 
minimum. Then G is Gateaux differentiable at B . 

By substituting the expression for 6p4(y.B;C) given by the LEMMA 

into the expression from the first T; * , and usin(j integration by pji'ts, we 

obtain the following result. 


THEOREM . Let B be a nonzero l^n vector for which a^f.'y.B) t otjfj(y,B) 
for i ^ j . Then G is Gateaux differentiable at B , and 


6G(B;C) * 


1-1 ’ ^ 


CE.B^ 

— f (y-Bp ) 
be^b' 


Cp. 


Ri(B) 
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where the notation 


denotes the sum of the values of the function 


I R^(B) 

at the right endpoints of the intervals comprising R^(B) minus the 


sum of its values at the left endpoints. 

If B is a nonzero l^^n vector which minimizes G(B) , then B 
must satisfy the vector equation 


9G(B) 

3B 


/66(B;C^)\ 

\6G(B;C^) j 



where C-,l<j<n,isa Ixn vector with a one in the slot 

and zeros elsewhere. Using the above formula for resulting 


from the previous THEOREM, we obtain a numerically tractable expression 

for the variation in the probability of misclassification G with 

respect to B . The use of this expression in a computational procedure 

for obtaining a nonzero Ixn B which minimizes G was developed by 
( 15 ) 

Guseman and Marion' ' . 

A 

If B is a nonzero l5<n vector which minimizes G , then the 

A j A 

entries p-..(B) in the error matrix P(B) for the optimal classification 

* V 

A 

procedure determined by the regions R^-(B) can be readily computed from 
the expression 


P.AB} = / - P.(y,B)dy 

R^(B) J 


m , 


, i, j = 1 , 2 ,. . . , 
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The linear feature selection procedure foi minimizin'} G{B) has 
been extended to the case where the density function for each class Is a 
convex combination of multivariate normals. Tr.is extension allows for the 
design of a one-dimensional “class A--not class A" classification procedure 
which could be used (for example) to classify wheat(s) vs. non-wheat(s). 

The associated computational procedure for this extension was developed 
by Guseman and Marlon^ 

Decell and Quirein^®^ develop explicit expressions for 6D(B;C) and 
6p(B;C) in terms of B and the known means and covariance matrices and 

^ ( D ( B 1 1 

, 1 < i < tn . These expressions immediately provide and 

for use in a Oavidon-Fletcher-Powell iteration scheme for 

determination of an e^ . ,*emum value of D(B) and p(B) , respectively. 

The explicit expressions are: 


mm 

3B 


and 


ffl -m i=l 


3B 


m-1 



nj 3 (Pd(1J)) 

j.l, ~5T- 


where 


9 


9{po(l»j)) T-li 

^ [B(I.+E^b'] ' (B5. .<5.jB')Lu. ' .j) 

- [B(E,+Ij)BV’B(E,+Ej) 

- j [(BE.B^)'’b;:^+(BE.b''’)''bEj] . 
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It is also shown In Decell and Qulrein^®^ that, in general, an absolute 
extremum of 6(B) , p(B) and D(B) always exists. For any one of the 
given functions G(B) , p(B) or D(B) the absolute extremum is attained 
3t B = (I|^|Z)U for some unitary matrix U , thus parameterizing the 
aforeirentioned extreme problems on the compact group of unitary matrices. 
In Brown and O'Malley' ' it is shown that the nature of the eigenvalues 
of U in no way provides any information about the extreme values of 
D{(I|^lZ)U). In Decell and Smiley^^^^ these results were refined in the 

sense that any extremal transformation can be expressed in the form B = 
{I|^|2)Hp. . where p < min{k,n-k) and is a Householder trans- 
formation i = l,...,p . The latter result suggests constructing a 
sequence of transformations (I|^lZ)H^, (Ij^lZlH^H^ ••• such that the 

values, of the class separability criterion (e.g. G(B) , p(B) , D(B)) 
evaluated at this sequence is a bounded, monotone sequence of real 
numbers. The construction of the i^^ element of the sequence of trans- 
formations requires the solution of an n-dimensional optimization problem. 
Recall that T(H) , the Householder transformations (see Householder^^®^ 

H = I - 2xx^ , X G with Ilx|| = 1 , is a compact connected subset 
of the unitary matrices for which h"^ = H = H’^ . We outline .some of 
these results beginning with the definition (for a case, say, when we 
wish to maximize 4> ): 

***(I,^lZ)Hi hVt(H) 

* 

THEOREM . For each positive integer i , let the element H. of T(H) 


be chosen such that 
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(I|j|Z)HiHj_i ... H e H„ ••• 


then. 


*dn|2)H, ... H, ^ ... H, 

<2) ♦{I^|Z)H, ... H,H *dk|Z)H,^,H, ... H, • ^ 

(3) ♦(IJZ)HH, ... H, i *dklZ)H,^, Hj ... H, • ^ T'**’ 

*dk|Z)H, ... H, .< ^I,|Z)H.,, H, ... H, 

for every H e T(H) and p = 0,...,1-2 . 

THEOREM. The sequence |Z)H. H.^T=1 bounded above and 

ji^*d|^|Z)Hi ... H, ■ '•'‘•^•^*d|,|Z)H, ... H,> ■ 


These theorems give rise to a sequential monotone procedure for possibly 
obtaining a 4>-extremal rank k linear combination matrix. At each stage 
in this procedure, the extremal problem is a function of only n variables 
We conjecture, under certain conditions, that the process should terminate 

9 

in at most min{k,n-k) steps. The conjecture is clearly in line with the 
minfk,n-k} representation of the actual <I'-extremal solution. Certainly 
the conjecture further depends on perhaps some pu logir il beh.Tvicr of 
and Tally^^^^ constructs such a pathological failure point. Talley^^^^ 
shows that the procedure actually converges to a 1>-extremum provided <J) 
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is T(H)-sloped. We will outline some of these results. Let ^ denote 
the set of unitary matrices and T(H) the Householder transformations. 


DEFINITION. 4> will be called T(H)- sloped provided U c 

imply there exists some H e T(H) (dependent on 


max 


that <^(U) < <^{HU) < = l.u.b. <^(U) . 

- max ^ 


and 

U ) such 


DEFINITION. A sequence in ^ will be called o- converqent 

provided converges. 

DEFINITION. A sequence {U^. in ^ will be called a $-Househ£l^er 
sequence provided H c :sf€ and i an integer imply 


(1) 4>(U.) < 4>(U.^^) 

(2) ${HU.) < . 


PROPOSITION. Each 4i-Householder sequence is ^-convergent 

and lim <J>(U.) = 'I'(U) = l.u.b. <I>(U.) for some U e . 
i ^ i ’ 

PROPOSITION. Each <l>-Householder sequence converges to if and 

fUaX 

only if 4> is T(H)-slcped. , 

PROPOSITION. If }“_i is a (^-Householder sequence and <^ is 
T(H)-sloped then exactly one of the following 


-17- 


(1) Is strictly monotonic (and convergent to 'P ) ; 

(2) for some integer k , l.^b. ♦(HUj^) < (in which case 

<t(U. ) = <^ ! ) . 

' max ' 


These techniques have been applied to the functions 4>(B) = D(B) 

and <t(B) * p(B) , respectively, by Decell and Mayekar^^^^ and Decell and 
12S) 

Marani' ' using Cl Flight line data. 

In each case explicit expressions for ^ 

|^p((I|^lZ)H)] where H = I-2xx^ , HxH =1 , have been developed for 

the m pattern class (=a's) case and used sequentially, according 
to the aforementioned theorems, to calculate the extreme values and the 
unitary matrices (as products of elements of T(H)) at which the extreme 
values occur. Some of the results are outlined in what follows. 

^ij ■ ^i 

L,j ■ ZjH(I|^|Z)^[(F|^|Z)HEjH(I|j|zrr' . 

Q,j • (xx\.(I,|ZH,j(l,|Z)xxT)T.(xx^.jn,i’ q,j(I,IZ)xx^' 

Ak a 

Jii % K. . , and L.. be similarly defined by substituting, 

ij iJ iJ j j j. 

A 

respectively, J.., K. ., and L.. for Q. . in tie expression fur Q.. , 

IJ 'J 

i.j=l.***»m . The resulting expressions are: 


Let 


and 


Let 


and let 
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m-1 


k W(lkl2)H)] ■ i i 


^ iti j-Ul (x^x? 


m 

I 




where 

■ (p^-Uj)tu,-|ij)^ . tr(-) • trace of (■) and |-| = det(-) 


= - j In (IJZ)HE,jH(I^lZ)^| + 1 lnl(IJZ)Hi;,H(I|^|Z)^ 

+ I In (IJZ)HZjH(I|^|Z)'^ ♦ I InZ . 

|r CD((I|,|Z)H)] - — ^ " {(M,-Nj^-(M,-N,))x 

9x k 1.1 ’ ’ ’ ' 

where 

M, - xx^ QidJZ) 

N, - q,(IJZ)xx^ 

Ql ■ [(5,e^-£^B^(BrjB^)'’(8S,B^)}(Br,B^)"’] 

B • (IJZ)(I-2xx^) . 

3 

Peters, Redner and Decell approach the problem of finding a 

n k 

minimum of 6(B) from the point of view of treating the mapping B ; R R 
(for some k < n) as a statistic and provide necessary and sufficient 

conditions 'that such a B be a sufficient statistic in the classical 

(2j) ( 28 ) ( 29 ) 

sense of Halmos and Savage' , Lehmann and Sheffe' , Bahadur' \ 
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LeCam^^^^ and Kul Iback^^^ . Although their results are much more general 
than required for dealing with the dimension reduction problem for a 
finite number of normal populations, the application they provide for 
such families actually allows one to write down the optimal dimension reducing 
k^n statistic B such that 6(B) » G (whenever such a B exists). 

Moreover, they also guarantee that there is no other B of smaller rank 
(i.e. < k) for which G(B) * 6 . 

We will singly state their application to the problem and refer the 
reader to Peters, Decell and Redner^^®^ for the more general applications 
to exponential families (e.g. Wishart and normal multivariate sampling). 

Let N(u^,Z^) , i » 0,1 m-1 be a n- variate normal family with 

Uq = 0 and Eq « I having densities 

P,.(x) • exp c- J(x-u,)^e;'(x-u,)] . 

The requirement Uq ■ 0 and Eq * I imposes no loss of generality since 
there exists a non singular matrix Mq for which MqEqMJ = I and a 

change of coordinate system defined by the transformation x -*• Mq(x-uq) 

allows one to recover the sufficient statistic in the original coordinate 
system. 

a 

THEOREM. Let Uq = 0 , Eq ' I and M = iu 2 l " * 1%.] ‘ ' 

)Ejj^_^-I] . B is a linear sufficient statistic for the given finite 

n-variate normal family if and only if range (B^) = range (M). Moreover, 
k = rank M is the smallest integer for which there exists a kxn 
sufficient statistic for the given family. 
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Again, note that this theorem completely determines the smallest 
dimension reduction possible such that G(B) » G . Moreover, as we 
will show by exan^le In what follows, there are any number of ways of 
finding a B such that range (B^) ■ range (M) . In fact, the 
theorem states that If rank M ■ n then there Is no dimension reducing 
sufficient statistic (I.e. G(B) > G for every k^n matrix B for 
which k < n ). 

(31 1 

The following result due to Oecell, Odell and Coberly' provides 
one means of calculating (and determining the existence of) the afore- 
mentioned sufficient statistic B for which 6(B) = G . 


THEOREM. Let 1!^ be an n-varlate normal population with a priori 
probability > 0 , mean and covariance 1 = 0,1,..., m-1 

(with 'Uq » 0 , Eq « I) and let FG » M = IH2I ’ ' ‘ lu ^_1 iT^-I I * * ' 

j be a full rank (= k < n) decomposition of M . Then, the 

n-varlate Bayes procedure assigns x to if and only if the 

k-variate Bayes procedure assigns F^x to IT- . Moreover, k is 


the smallest integer for which there exists a kxn matrix T 
preserving the Bayes assignment of x and Tx to IT. ; 1 = 0,1,..., m-1 . 

3 ' 

These results completely characterize the nature of data compression 
for the Bayes classification procedure for normal classes in the sense that 
k Is the smallest allowable data compression dimension consistent with 
preserving- Bayes population assignment. Moreover, the theorem provider 
an explicit expression for the compression matrix T that depends only 
upon the known population means and covariances. The statistic T = F^ 
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given by the THEOREM is by no means unique (e.g. for any non singular 
kxk matrix A , T = AF^ will do). It is also true that there may be 
more efficient methods for calculating the statistic T (yet to be 
determined) than the method of full rank decomposition of M . 

It should be noted that the matrix M has an "excellent chance" 
of having rank equal to n . Even in the case of two populations 
(m » 2), there may well be n linearly independent columns among the 
2(n+l) columns of M and, therefore, no Integer kxn and k^n rank 
k con^ression matrix T preserving the Bayes assignment of x and Tx . 

Peters^^^^ treats the problem of determining sufficient statistics 
for mixtures of probability measures in a homogeneous family. We 
refer the reader to Teicher^^^^^^^^, and Yakowitz^^^^^^^^ for the treat- 
ment of this rather profound subject. 

The linear feature selection techniques mentioned above when used in 
a LACIE type application are based on the assumption that each class 
conditional density function is multivariate normal and that the associated 
parameters (u^ , , 1 < i < «n) known or can be estimated. In some 

cases eitijer the normality assumptions may be violated or else the 
determination of the number of classes present and their associated para- 

9 

meters is not possible. The question then arises as to how one might perform 
a dimensionality reduction without losing much of the "separation" present 
in measurement space. For example, one might be 1nteres‘ed in displayin 
a registered multipass LANDSAT data set on a three color display device 
without a priori knowledge of class structure in the data. 
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Each of the previous linear feature relectlon techniques uses a 

statistical definition of the word separation. The following procedure, 

(71 

due to Bryant and Guseman' * makes no statistical assun^tlons about the 
data. In addition, no labelled subsets (training data) are required. 

In this sense the linear feature selection technique outlined below Is 
distribution free. 

Basically the problem can be stated as follows: 

Given distinct (prototype) vectors Xp x^.-.-.Xp in r” , and 

n k 

k, 1 < k < n , determine a linear transformation A : R -► R which 
minimizes 


F(A) . I (l|x.-x,||-||Ax,-Ax 11)2 , 
l<1<j<P ^ ^ ^ ^ 

where the norms ||x^-Xj|| and ||Ax^-AXj|| are the Euclidean norms 

In r” and , respectively. Let m * p(p-l)/2 and let 

{z. : 1 < 1 < m} denote the m distinct differences of the prototypes 
1 - - 


Xj . If A • (a^j)|j,^„ . 2^ * ^2iT“-»^1n^ 
the gradient o1 F at A Is given by 




, then 


. AS - AT(A) , 

a 

where 


5 is the nxn matrix S 



and 
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T(A) Is the nxn matrix T(A) “I j| 
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standard optimization techniques can be used to obtain A which 
minimizes F . 

For a given data set (e.g. a LANOSAT sample segment) there are several 
ways to choose the prototype vectors , 1 < 1 < m . For example, 

one might choose cluster centers from the output of a clustering 
algorithm. 


CONCLUDING REMARKS 

There are, of course, ^ hoc feature selection procedures based 

upon specific problem knowledge and empirical studies. An example of 

( 37l 

such a procedure is the transformation of Kauth and Thomas' ' used in 
the analysis of LANDSAT data. This transformation is based upon an 
empirical data study and is described by an orthogonal coordinate change 
4 4 

U:R -♦ R . Application of the transform U to LANDSAT measurements 
simply produces a reduced feature space of dimension 2 (Brightness- 
Greenness). This is essentially accomplished at each LANDSAT measure- 
ment X = (x^, x^, Xj, x^)^ by the mapping: 



The Kauth-Thomas transform has proven to be of value in LAC IE 
cpplications (e.g. physical Interpretation, dimension reduction, scatter 
plots, etc.). As one would expect, the Kauth-Thomas transform is not a 
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sufficient statistic nor will it, in general, preserve LAliUSAT Bayes 
class assignment in feature spac.-. 

Feature selection te-i^.nlgues are currently being studied as a tool 

for "optimum pass" selection problems in LACIF. The basic objective is 

to develop a technique for a prJoQ selection (ba-^ed on some separability 

criterion) of subsets of LANDSAT acquisitions for analysis to separate 

wheat from nonwheat when given an adequate sample of labelled wheat and 

nonwheat LACIE segment pixel data. There are preliminary results in this 

( 3A ) 

direction due to Guseman and Marion' using one dimensional feature 
selection which minimizes G(B). 

In still another LACIF application, studies are being performed on 
parametric and nonparanetric feature selection techniques that allow 
analyst/interpreters to better separate spring wheat from other small 
grains' in a reduced feature space (e.g. Brightness-Greenness), In 
this connection, labelled wheat and other small grain LACIE segment 
pixel data and ancillary data are being used to estimate the distribution 
functions for spring wheat and other spring sniall grains. Feature 
selection methods are being used to find a priori statistically optimum 
features and associated discriminant functions. These will be compared 
to the brightness and greenness features currently used by ‘JASA/JSC. 

Methods for estimating class proportions, based on the linear feature 
selection procedure for minimizing G(B) , have bt ■ ' devrloped by Guscr..»n 
and Walton^^^^ ’ . In both papers, the proport.on pstinjtion teclmiquco 
rely on the fact that one cun readily compute the error matrices assoc i - ted 
with the optimal classifier produced by the linear feature selection 
procedure. 
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( 41 ' 

Other results of general related interest aopear in l^abu and Kalra' ' , 

(421 (431 (441 

Kadota and Shepp' , Marill and Green' , Swain and Kinr , Tou and 

Heydorn^^^\ Watanabe^^^^ Wce^^^^, and Wheeler, Misra and 



